arxiv: 1809.11096 · v2 · submitted 2018-09-28 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock , Jeff Donahue , Karen Simonyan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 12:07 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords GANBigGANimage synthesisInception ScoreFrechet Inception Distanceorthogonal regularizationtruncation trickImageNet

0 comments

The pith

Scaling GAN training with orthogonal regularization and a truncation trick yields new state-of-the-art results on class-conditional ImageNet synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative Adversarial Networks have long struggled to produce diverse, high-resolution images from complex datasets such as ImageNet. This paper trains GANs at the largest scale attempted and identifies instabilities that arise at that scale. Applying orthogonal regularization to the generator stabilizes training enough to support a simple truncation method that reduces input variance and trades variety for fidelity. The resulting BigGAN models reach an Inception Score of 166.5 and Frechet Inception Distance of 7.4 on 128x128 ImageNet images, far above the prior best marks of 52.52 and 18.6.

Core claim

The authors train Generative Adversarial Networks at the largest scale yet attempted and find that orthogonal regularization applied to the generator makes the model amenable to a truncation trick that reduces the variance of its input; this enables explicit control over the fidelity-variety trade-off and produces new state-of-the-art class-conditional image synthesis, with BigGANs reaching an Inception Score of 166.5 and Frechet Inception Distance of 7.4 on 128x128 ImageNet images.

What carries the argument

The truncation trick, which reduces variance of the generator's latent input to improve fidelity at the cost of variety, enabled by orthogonal regularization on the generator weights.

If this is right

Class-conditional image generation at 128x128 resolution reaches substantially higher fidelity than earlier methods.
Orthogonal regularization stabilizes large-scale GAN training enough to allow simple post-training control of output statistics.
The same scaling and regularization approach can be applied to other datasets and resolutions while preserving the truncation control.
Higher metric scores translate into visibly more realistic and varied samples from complex distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The truncation mechanism may generalize to unconditional generation or other generative architectures if similar regularization is applied.
Controlled trade-offs between fidelity and diversity could support targeted use cases such as data augmentation for downstream vision tasks.
Further increases in model scale combined with the same regularization might continue to improve results until new instabilities appear.

Load-bearing premise

The reported Inception Score and Frechet Inception Distance on standard ImageNet splits reliably reflect human-perceived image quality and diversity.

What would settle it

A direct human preference study in which raters consistently judge samples from prior GANs as higher quality than BigGAN samples at the same resolution.

read the original abstract

Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple "truncation trick," allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator's input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128x128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance (FID) of 7.4, improving over the previous best IS of 52.52 and FID of 18.6.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BigGAN shows that scaling GAN training plus orthogonal regularization and a truncation trick can deliver much stronger ImageNet results than before, though the work stays mostly empirical.

read the letter

The main point is that training GANs at much larger scale than prior work, with orthogonal regularization applied to the generator, makes a simple truncation of the latent input effective for trading off fidelity against variety. This produces the reported jumps to IS 166.5 and FID 7.4 on ImageNet at 128x128, well above the previous numbers they cite. The paper also spends time on the instabilities that show up at this scale and how their changes address them. That combination is what is new here. Earlier GAN papers had not demonstrated these stabilizations at this size on this dataset. The ablations help show that the regularization and truncation each contribute, and the results are presented with enough detail on architecture and training choices to be useful. The central empirical claims rest on standard independent metrics rather than any circular fitting, so the numbers stand on their own terms. The soft spots are the usual ones for large-scale empirical GAN work. Training is expensive, so exact replication will be limited. They do not report error bars or results across multiple seeds, which leaves some uncertainty about how stable the gains are. IS and FID remain imperfect correlates with human judgment, and the paper does not add much new analysis on why scale helps or how the regularization works mechanistically. This is for people following the GAN literature who want to know current best practices for class-conditional high-resolution synthesis or who need stronger baselines for their own experiments. A reader working on generative models or data augmentation will get practical value from the methods and numbers. It deserves a serious referee because the performance lift is large, the experiments are clearly described, and the contributions are relevant even if some variance and metric discussion would strengthen a revision. I recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript presents BigGAN, a class-conditional GAN trained at unprecedented scale on ImageNet. Architectural scaling, orthogonal regularization applied to the generator, and a truncation trick on the latent input are introduced to mitigate scale-induced instabilities; the resulting models report new state-of-the-art Inception Score of 166.5 and FID of 7.4 at 128×128 resolution, improving on prior bests of 52.52 and 18.6.

Significance. If the empirical results hold, the work demonstrates that targeted scaling combined with simple stabilization techniques can substantially advance high-fidelity, diverse image synthesis on complex datasets, providing both new benchmarks and practical insights into training dynamics at large scale. The ablations on regularization and truncation add concrete value for the community.

major comments (2)

[Experiments section (Table 1)] Experiments section (Table 1 and associated text): the IS of 166.5 and FID of 7.4 are reported from single training runs without error bars, standard deviations, or results across multiple random seeds; given the well-known stochasticity of GAN optimization, this weakens the strength of the strict SOTA claim.
[§3.2] §3.2 (truncation trick): the quantitative trade-off curves between fidelity and diversity are shown only for a single truncation threshold schedule; additional analysis of sensitivity to the precise threshold value and its interaction with the orthogonal regularization term would be needed to confirm robustness of the reported gains.

minor comments (2)

[Abstract and §4] The abstract and §4 could explicitly state the total parameter count and approximate compute (GPU-hours) used for the largest models to contextualize the 'largest scale yet attempted' claim.
[Figures 3-5] Figure captions for generated samples should include the exact truncation threshold and class-conditioning details used for each panel to improve reproducibility of the visual results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and recommendation for minor revision. We address the major comments point by point below, with proposed revisions where appropriate.

read point-by-point responses

Referee: Experiments section (Table 1 and associated text): the IS of 166.5 and FID of 7.4 are reported from single training runs without error bars, standard deviations, or results across multiple random seeds; given the well-known stochasticity of GAN optimization, this weakens the strength of the strict SOTA claim.

Authors: We acknowledge that reporting results from single training runs limits the ability to quantify variability, and that multiple seeds would provide stronger evidence for the SOTA claims. Training at this scale is extremely resource-intensive, which constrained our ability to run extensive replicates. The reported gains (IS from 52.52 to 166.5, FID from 18.6 to 7.4) are large relative to prior work, and our ablations demonstrate consistent benefits from the proposed techniques. In the revised manuscript we will add a paragraph in the Experiments section explicitly noting the single-run limitation and the potential for stochastic variation in GAN training. revision: partial
Referee: §3.2 (truncation trick): the quantitative trade-off curves between fidelity and diversity are shown only for a single truncation threshold schedule; additional analysis of sensitivity to the precise threshold value and its interaction with the orthogonal regularization term would be needed to confirm robustness of the reported gains.

Authors: We agree that further analysis would strengthen the presentation of the truncation trick. The curves in §3.2 use the schedule we found most effective after initial tuning, but we can expand the section to include additional threshold values and a short discussion of how orthogonal regularization interacts with truncation (the regularization helps stabilize the generator under reduced-variance inputs). We will revise §3.2 to incorporate sensitivity plots and a brief robustness note. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study of large-scale GAN training rather than a derivation of new theoretical results. It reports experimental outcomes from training BigGAN models on ImageNet, including measured IS of 166.5 and FID of 7.4, supported by ablations on orthogonal regularization and the truncation trick. These metrics are standard, externally defined, and evaluated on fixed dataset splits independent of the paper's own equations or fitted parameters. No load-bearing step reduces by construction to self-definition, fitted inputs renamed as predictions, or self-citation chains; the central claims rest on reproducible training runs and standard evaluation protocols.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The work is almost entirely empirical. It relies on standard GAN loss formulations and ImageNet data assumptions without introducing new mathematical axioms or invented entities. Hyperparameters for model width, depth, and truncation threshold are tuned to data.

free parameters (2)

truncation threshold
Value chosen post-training to trade fidelity against diversity; directly affects reported metrics.
generator and discriminator channel multipliers
Scaled architecture widths fitted to available compute and stability.

pith-pipeline@v0.9.0 · 5452 in / 1195 out tokens · 42837 ms · 2026-05-13T12:07:01.086101+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128x128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.5 and Frechet Inception Distance (FID) of 7.4

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Prompt-to-Prompt Image Editing with Cross Attention Control
cs.CV 2022-08 unverdicted novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
One-Step Generative Modeling via Wasserstein Gradient Flows
cs.LG 2026-05 conditional novelty 7.0

W-Flow achieves state-of-the-art one-step ImageNet 256x256 generation at 1.29 FID by training a static neural network to follow a Wasserstein gradient flow that minimizes Sinkhorn divergence, delivering roughly 100x f...
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
cs.CV 2026-04 conditional novelty 7.0

SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
Less is More: Recursive Reasoning with Tiny Networks
cs.LG 2025-10 unverdicted novelty 7.0

TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
One Step Diffusion via Shortcut Models
cs.LG 2024-10 conditional novelty 7.0

Shortcut models enable high-quality single or few-step sampling in diffusion models with one network and training phase by conditioning on desired step size.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
cs.CV 2022-05 accept novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
cs.CV 2021-12 accept novelty 7.0

A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
Diffusion Models Beat GANs on Image Synthesis
cs.LG 2021-05 accept novelty 7.0

Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection
cs.CV 2026-05 conditional novelty 6.0

SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
Reference-based Category Discovery: Unsupervised Object Detection with Category Awareness
cs.CV 2026-05 unverdicted novelty 6.0

RefCD enables unsupervised category-aware object detection by using feature similarity between predicted objects and unlabeled reference images to guide category learning.
Intermediate Representations are Strong AI-Generated Image Detectors
cs.CV 2026-05 unverdicted novelty 6.0

Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
Efficient Diffusion Distillation via Embedding Loss
cs.CV 2026-04 unverdicted novelty 6.0

Embedding Loss aligns feature distributions via MMD in random network embeddings to boost one-step diffusion distillation, reaching SOTA FID of 1.475 on CIFAR-10 unconditional generation.
Pairing Regularization for Mitigating Many-to-One Collapse in GANs
cs.LG 2026-04 unverdicted novelty 6.0

Pairing regularization mitigates intra-mode collapse in GANs by penalizing redundant latent-to-sample mappings, improving recall under collapse-prone conditions or precision under stabilized training.
Frequency-Aware Flow Matching for High-Quality Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

FreqFlow introduces frequency-aware conditioning and a two-branch architecture to flow matching, reaching FID 1.38 on ImageNet-256 and outperforming DiT and SiT.
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
Multimodal Large Language Models for Multi-Subject In-Context Image Generation
cs.LG 2026-04 unverdicted novelty 6.0

MUSIC is the first MLLM for multi-subject in-context image generation that uses an automatic data pipeline, vision chain-of-thought reasoning, and semantics-driven spatial layout planning to outperform prior methods o...
MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
cs.CV 2026-03 unverdicted novelty 6.0

MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
VideoGPT: Video Generation using VQ-VAE and Transformers
cs.CV 2021-04 accept novelty 6.0

VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Information theoretic underpinning of self-supervised learning by clustering
cs.LG 2026-05 unverdicted novelty 5.0

SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
cs.CV 2026-05 unverdicted novelty 5.0

MDMF detects AI-generated images by learning patch-level forensic signatures and quantifying their distributional discrepancies with MMD, yielding larger separation than global methods when micro-defects are present.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection
cs.CV 2026-04 unverdicted novelty 5.0

FGINet uses a band-masked frequency encoder and layer-wise gated injection to fuse frequency artifacts with vision foundation model semantics, plus hyperspherical compactness learning, to achieve better generalization...
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
cs.CV 2026-04 unverdicted novelty 4.0

I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 27 Pith papers · 1 internal anchor

[1]

A note on the Inception Score

Shane Barratt and Rishi Sharma. A note on the Inception Score. InarXiv preprint arXiv:1801.01973,

work page arXiv
[2]

The Cramer Distance as a Solution to Biased Wasserstein Gradients

Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and R´emi Munos. The Cramer distance as a solution to biased Wasserstein gra- dients. In arXiv preprint arXiv:1705.10743,

work page Pith review arXiv
[3]

Dai, Shakir Mohamed, and Ian Goodfellow

9 Published as a conference paper at ICLR 2019 William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In ICLR,

work page 2019
[4]

On convergence and stability of GANs

Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of GANs. In arXiv preprint arXiv:1705.07215,

work page arXiv
[5]

Geometric GAN

Jae Hyun Lim and Jong Chul Ye. Geometric GAN. In arXiv preprint arXiv:1705.02894,

work page arXiv
[6]

Xudong Mao, Qing Li, Haoran Xie, Raymond Y . K. Lau, and Zhen Wang. Least squares generative adversarial networks. In arXiv preprint arXiv:1611.04076,

work page arXiv
[7]

Megapixel size image creation using generative adversarial networks

Marco Marchesi. Megapixel size image creation using generative adversarial networks. In arXiv preprint arXiv:1706.00082,

work page arXiv
[8]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. In arXiv preprint arXiv:1411.1784,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Brown, Christopher Olah, Colin Raf- fel, and Ian Goodfellow

10 Published as a conference paper at ICLR 2019 Augustus Odena, Jacob Buckman, Catherine Olsson, Tom B. Brown, Christopher Olah, Colin Raf- fel, and Ian Goodfellow. Is generator conditioning causally related to GAN performance? In ICML,

work page 2019
[10]

Comparing generative adversarial network techniques for image creation and modiﬁcatio

Mathijs Pieters and Marco Wiering. Comparing generative adversarial network techniques for image creation and modiﬁcatio. In arXiv preprint arXiv:1803.09093,

work page arXiv
[11]

Dropout: A simple way to prevent neural networks from overﬁtting.JMLR, 15:1929–1958,

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting.JMLR, 15:1929–1958,

work page 1929
[12]

A note on the evaluation of generative models

Lucas Theis, A ¨aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In arXiv preprint arXiv:1511.01844,

work page Pith review arXiv
[13]

The unusual effectiveness of averaging in gan training

Yasin Yazc, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and Vijay Chandrasekhar. The unusual effectiveness of averaging in gan training. In arXiv preprint arXiv:1806.04498,

work page arXiv
[14]

Self-Attention Generative Adversarial Networks

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In arXiv preprint arXiv:1805.08318,

work page Pith review arXiv
[15]

Figure 6: Samples generated by our BigGAN model at 512×512 resolution

11 Published as a conference paper at ICLR 2019 APPENDIX A A DDITIONAL SAMPLES , INTERPOLATIONS , AND NEAREST NEIGHBORS FROM IMAGE NET MODELS Figure 5: Samples generated by our BigGAN model at 256×256 resolution. Figure 6: Samples generated by our BigGAN model at 512×512 resolution. 12 Published as a conference paper at ICLR 2019 (a) (b) Figure 7: Compari...

work page 2019
[16]

The generated image is in the top left

feature space. The generated image is in the top left. 14 Published as a conference paper at ICLR 2019 Figure 11: Nearest neighbors in ResNet-50-avgpool (He et al.,

work page 2019
[17]

The generated image is in the top left

feature space. The generated image is in the top left. Figure 12: Nearest neighbors in pixel space. The generated image is in the top left. 15 Published as a conference paper at ICLR 2019 Figure 13: Nearest neighbors in VGG-16-fc7 (Simonyan & Zisserman,

work page 2019
[18]

The generated image is in the top left

feature space. The generated image is in the top left. 16 Published as a conference paper at ICLR 2019 APPENDIX B A RCHITECTURAL DETAILS In the BigGAN model (Figure 15), we use the ResNet (He et al.,

work page 2019
[19]

(2018); Gulrajani et al

GAN architecture of (Zhang et al., 2018), which is identical to that used by (Miyato et al., 2018), but with the channel pattern in D modiﬁed so that the number of ﬁlters in the ﬁrst convolutional layer of each block is equal to the number of output ﬁlters (rather than the number of input ﬁlters, as in Miyato et al. (2018); Gulrajani et al. (2017)). We us...

work page 2018
[20]

differs from BigGAN in several aspects. It uses a simpler vari- ant of skip-z conditioning: instead of ﬁrst splittingz into chunks, we concatenate the entire z with the class embedding, and pass the resulting vector to each residual block through skip connections. BigGAN-deep is based on residual blocks with bottlenecks (He et al., 2016), which incorporat...

work page 2016
[21]

Relative to the 256× 256 architecture, we add an additional ResBlock at the 512× 512 resolution

(b) Discriminator 19 Published as a conference paper at ICLR 2019 Table 6: BigGAN architecture for 512× 512 images. Relative to the 256× 256 architecture, we add an additional ResBlock at the 512× 512 resolution. Memory constraints force us to move the non-local block in both networks back to 64× 64 resolution as in the 128× 128 pixel setting. z∈ R160∼N (...

work page 2019
[22]

(b) Discriminator 20 Published as a conference paper at ICLR 2019 Table 8: BigGAN-deep architecture for 256× 256 images. z∈ R128∼N (0,I ) Embed(y)∈ R128 Linear (128 + 128)→ 4× 4× 16ch ResBlock 16ch→ 16ch ResBlock up 16ch→ 16ch ResBlock 16ch→ 16ch ResBlock up 16ch→ 8ch ResBlock 8ch→ 8ch ResBlock up 8ch→ 8ch ResBlock 8ch→ 8ch ResBlock up 8ch→ 4ch Non-Local ...

work page 2019
[23]

(b) Discriminator 21 Published as a conference paper at ICLR 2019 Table 9: BigGAN-deep architecture for 512× 512 images. z∈ R128∼N (0,I ) Embed(y)∈ R128 Linear (128 + 128)→ 4× 4× 16ch ResBlock 16ch→ 16ch ResBlock up 16ch→ 16ch ResBlock 16ch→ 16ch ResBlock up 16ch→ 8ch ResBlock 8ch→ 8ch ResBlock up 8ch→ 8ch ResBlock 8ch→ 8ch ResBlock up 8ch→ 4ch Non-Local ...

work page 2019
[24]

We employ the architectures detailed in Appendix B, with non-local blocks inserted at a single stage in each network

(b) Discriminator 22 Published as a conference paper at ICLR 2019 APPENDIX C E XPERIMENTAL DETAILS Our basic setup follows SA-GAN (Zhang et al., 2018), and is implemented in TensorFlow (Abadi et al., 2016). We employ the architectures detailed in Appendix B, with non-local blocks inserted at a single stage in each network. Both G and D networks are initia...

work page 2019
[25]

We train on a Google TPU v3 Pod, with the number of cores proportional to the resolution: 128 for 128×128, 256 for 256×256, and 512 for 512×512

is used in bothG and D, following SA-GAN (Zhang et al., 2018). We train on a Google TPU v3 Pod, with the number of cores proportional to the resolution: 128 for 128×128, 256 for 256×256, and 512 for 512×512. Training takes between 24 and 48 hours for most models. We increase ϵ from the default 10−8 to 10−4 in BatchNorm and Spectral Norm to mollify low-pre...

work page 2018
[26]

23 Published as a conference paper at ICLR 2019 APPENDIX D A DDITIONAL PLOTS Figure 17: IS vs

The discrepancy between training and validation scores is due to the Inception classiﬁer having been trained on the training data, resulting in high-conﬁdence outputs that are preferred by the Inception Score. 23 Published as a conference paper at ICLR 2019 APPENDIX D A DDITIONAL PLOTS Figure 17: IS vs. FID at 128×128. Scores are averaged across three ran...

work page 2019
[27]

FID at 256 ×256

24 Published as a conference paper at ICLR 2019 5 10 15 20 25 30 35 40 45 50 JFT-300M Inception Score 0 20 40 60 80 100 120 140 160 180 JFT-300M FID FID vs IS as a function of truncation Ch=128 Ch=96 Ch=64 Ch=64 (Baseline) 15 20 25 30 35 40 45 50 JFT-300M Inception Score 10 20 30 40 50 60 70 80 JFT-300M FID FID vs IS as a function of truncation Ch=128 Ch=...

work page 2019
[28]

The curve labeled with baseline corresponds to the ﬁrst row (with orthogonal regularization and other techniques disabled), while the rest correspond to rows 2-4 – the same architecture at different capacities (Ch). 25 Published as a conference paper at ICLR 2019 APPENDIX E C HOOSING LATENT SPACES While most previous work has employedN (0,I ) orU [−1, 1] ...

work page 2019
[29]

28 Published as a conference paper at ICLR 2019 (a)σ0 (b) σ0 σ1 (c)σ1 (d)σ2 Figure 23: G training statistics with an R1 Gradient Penalty of strength 10 on D

Collapse occurs after 125000 iterations. 28 Published as a conference paper at ICLR 2019 (a)σ0 (b) σ0 σ1 (c)σ1 (d)σ2 Figure 23: G training statistics with an R1 Gradient Penalty of strength 10 on D. This model does not collapse, but only reaches a maximum IS of

work page 2019
[30]

This model does not collapse, but only reaches a maximum IS of

29 Published as a conference paper at ICLR 2019 (a)σ0 (b) σ0 σ1 (c)σ1 (d)σ2 Figure 25: G training statistics with Dropout (keep probability 0.8) applied to the last feature layer of D. This model does not collapse, but only reaches a maximum IS of

work page 2019
[31]

Collapse occurs after 200000 iterations

30 Published as a conference paper at ICLR 2019 (a) G∥W∥2 (b) D∥W∥2 (c) losses (d) Variance of all gradient norms in G and D Figure 27: Additional training statistics for a typical model without special modiﬁcations. Collapse occurs after 200000 iterations. (a) G∥W∥2 (b) D∥W∥2 (c) losses (d) Variance of all gradient norms in G and D Figure 28: Additional ...

work page 2019
[32]

31 Published as a conference paper at ICLR 2019 APPENDIX G A DDITIONAL DISCUSSION : S TABILITY AND COLLAPSE In this section, we present and discuss additional investigations into the stability of our models, expanding upon the discussion in Section

work page 2019
[33]

The consequence ofG being allowed to win the game is a complete breakdown of the training process, regardless of G’s conditioning or optimization settings

This leads to two conclusions: ﬁrst, as has been noted in previous works (Miyato et al., 2018; Gulrajani et al., 2017; Zhang et al., 2018),D must remain optimal with respect to G both for stability and to provide useful gradient information. The consequence ofG being allowed to win the game is a complete breakdown of the training process, regardless of G’...

work page 2018
[34]

Increasing the margin beyond 3 results in unstable training similar to using the Wasserstein loss

does not prevent collapse or reduce the noise in D’s spectra. Increasing the margin beyond 3 results in unstable training similar to using the Wasserstein loss. Finally, the memorization argument might suggest that using a smaller D or using dropout in D would improve training by reducing its capacity to memorize, but in practice this degrades training. 3...

work page 2019